Text Mining

Text Mining ist ein semi-automatisierter Prozess der Wissensentdeckung in textuellen Daten. Target is to go from unorganized state, to organized state.

Gibt 2 Typen:

  • Information Retrieval (relevante Dokumente auf Basis bekannter Fakten extrahieren)
  • Knowledge Discovery zur Extraktion zuvor unbekannter Information.

Use-cases: Web Mining, Sentiment Analysis, …

Abgrenzung NLP

NLP Text Mining
Verstehen von Text und Sprache Finden von Mustern
nutzt häufig Deep-Learning und Tools wie spaCy (open-source library for Natural Language Processing in Python) nutzt Feature Hashing und Distanz-Maße zusammen mit Machine-Learning-Methoden
geht auch auf Audio ein bezieht sich nur auf Text
kann übersetzen kann nur Beziehungen aufzeigen (leichter zu bewerten)
viel breitere Anwendung gegenüber SPAM oder Sentiment analysis gewisse NLP-Prozesse können Teil vom Text-Mining-Preprocessing sein

Workflow

  • Problem definition
  • defining the text you use as data source
  • Organize text: e.g. chronologically or by author
  • Feature Extraction: sentiment or word tokens into matrices
  • Analysis
  • Reach insight or produce output (pedictive modeling)

Prozess

Term Document Matrix (TDM) / Document Term Matrix (DTM) ist quasi Transposition voneinander.

Two General Methods

Method 1: Semantic parsing

Important are type of words and their order. One word can have three features associated:

  • is part of a sentence,
  • is part of a phrase
  • and is a type of word.

Method 2: Bag of words

Treats every word as a single token in the sentence, no matter the type or order.

Word Embeddings

(= Worteinbettungen)

Analyze how words are embedded in text. In what context do words appear? So word embedding is a technique for identifying similarities between words by using a model to predict the co-occurence of words. By examining the adjacency of words in a model created space, the models can complete analogies such as “Man is to woman as king is to queen.”

Word embeddings are created by identifying the words that occur within a “context window” (= words before and after a center word). These words will be used to train a word embedding model. Center word + context words = vector of numbers for presence or absence of unique words within a dataset.

Two types of embedding models

Continuous Bag of Words (CBOW) Model Skip-Gram Model
reads in the context window words and tries to predict the most likely center word predicts the context words given the center word
useful in practical applications such as predictive web search most useful for identifying patterns within texts to represent them in multimensional space

Tidytext

Nutzt One-token-per-row-Prinzip: Token ist eine sinnvolle Entität im Text (Paragraphen, N-Gramme, Wörter, Subwörter). Dieses Prinzip unterscheidet tidytext von anderen Ansätzen:

  • String: meist die Anfangsform
  • Corpus: raw strings, die mit Metadaten annotiert sind
  • Document-term matrix: Sparse (nicht dense) matrix, die den Corpus numerisch beschreibt, Wortanzahl oder tf-idf
library(janeaustenr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
library(stringr)
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text_df <- tibble(line = 1:4, text = text)
text_df %>%
unnest_tokens(word, text)

N-gram modeling is alternative to word embedding techinques. It is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. Used to convert text from an unstructured format to a structured format.

For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon

Difference from N-gram model to skip gram model | answer from chatGPT:

While Skip-gram and N-gram models are both used in text mining and natural language processing, they are different in their approaches and purposes.

  • N-gram Model: This model predicts the occurrence of a word based on the probability of it appearing given the previous N words. For instance, in a bigram model (N=2), the probability of a word is predicted based on the previous word. It’s a form of probabilistic language model that calculates the probability of a word given its previous N words.

  • Skip-gram Model: Skip-gram is a type of word embedding model used to represent words in a continuous vector space. It’s part of the word2vec framework developed by Google. Unlike N-grams, the skip-gram model is not focused on predicting words sequentially. Instead, it tries to learn representations of words that are useful for predicting the context words within a certain window around the target word.

In summary, while both are used in text mining, N-gram models focus on predicting sequential words, whereas skip-gram models focus on learning distributed representations of words in a continuous vector space, often used for tasks like word similarity and language modeling.

Tokenisierung in Tidytext

  • “words” (default): Nur Wörter

  • “characters”: nur Buchstaben

  • “character_shingles”: wie n-gram, aber mit Buchstaben

  • “ngrams”: Mehrere Wörter als eine Einheit

    text_df %>% unnest_tokens(ngram, text, token = "ngrams", n = 2)

  • “skip_ngrams”

    text_df %>% unnest_tokens(skip_ngrams, text, token = "skip_ngrams", n = 2)

  • “regex”: regex halt

  • “tweets”: Wort-Tokenisierung mit Bewahrung von usernames, hashtags und URLS

  • “ptb” (Penn Treebank): “This tokenizer uses regular expressions to tokenize text similar to the tokenization used in the Penn Treebank. It assumes that text has already been split into sentences. The tokenizer does the following:

    • splits common English contractions, e.g. don’t > do n’t and they’ll > they ’ll;
    • handles punctuation characters as separate tokens;
    • splits commas and single quotes off from words, when they are followed by whitespace;
    • splits off periods that occur at the end of the sentence”
text_df %>% unnest_tokens(ngram, text, token = "ngrams", n = 2)
text_df %>% unnest_tokens(skip_ngrams, text, token = "skip_ngrams", n = 2)

Summary

Was passiert hier?

library(dplyr)
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()
original_books

austen_books() contains 6 novels from Jane Austen as a tidy data frame: two columns: text, which contains the text of the novels divided into elements of up to about 70 characters each, and book, which contains the titles of the novels as a factor in order of publication.

Then add columns where row number get saved as line number and the recognized reg-ex as chapter number.

Was passiert hier?

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_books

Every word gets separated on a new line along with the book title, line number and chapter number.

data(stop_words)
tidy_books <- tidy_books %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_books %>%
  count(word, sort = TRUE)
library(ggplot2)
tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

stop_words is a data frame from tidytext with 1149 english stop words (about, above, after, etc).

anti_join filters rows based on the presence or absence of matches.

Then the words that are left are counted and plotted, if they are above 600 counts.

Word clouds

library(wordcloud)
## Loading required package: RColorBrewer
set.seed(123)
wordcloud(simplify2array(tidy_books %>%
                           count(word, sort = TRUE) %>% select(1)),
          simplify2array(tidy_books %>%
                           count(word, sort = TRUE) %>% select(2)),
          min.freq=40, max.words=100)

Einlesen von PDF-Dokumenten

library(pdftools)
## Using poppler version 23.04.0
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
all_pdfs <- list.files(pattern = "06_BusinessIintelligence_Text Mining_HS23.pdf$")
sw <- get_stopwords(language = "de", source = "snowball")
map_df(all_pdfs, ~ tibble(txt = pdf_text(.x), mustWork = F) %>%
mutate(filename = .x) %>%
unnest_tokens(word, txt)) %>%
anti_join(stop_words) %>%
anti_join(sw) %>%
filter(!grepl("[0-9]", word)) -> mypdf
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
library(wordcloud2)
wordcloud2(size=.4,mypdf %>%
             count(word,sort = TRUE)
          )

Hausaufgaben

a) Word Embeddings und PCA

Finden Sie heraus, was Word Embeddings sind und beschreiben Sie in einigen wenigen Sätzen, was diese sind und mit PCA-Komponenten gemeinsam haben. https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html

Lösung:

Embedding Models analysieren den Kontext von Worten in einem Satz/Text. Dafür wird ein Zentrum bestimmt und dann wird geschaut, welche Wörter vor und nach diesem Zentrum stehen (Context Window). Die Wörter im Context Window können als Vektoren repräsentiert werden. Dann gibt es zwei Methoden zur von Word Embeddings: Continuous Bag of Words (CBOW) Model und Skip-Gram Model.

  • Bei CBOW wird aus Wörtern des Kontextwindows versucht das Zentralwort zu bestimmen.

  • Bei Skip-Gram werden genau umgekehrt aus dem Zentralwort die Kontextwörter bestimmt.

Das CBOW-Modell kann vermutlich wie eine PCA verstanden werden. Die Dimensionen werden versucht zu reduzieren auf ein Wort, welches am stärksten mit den eingegebenen Wörtern zusammenhängt, sich am besten darin einbettet.

b) skip_ngrams

Beschreiben Sie, was skip_ngrams in der Funktion unnest_tokens macht und wozu man es etwas brauchen könnte.

Lösung: Es geht alle Wörter der Reihe nach durch und stellt sie zuerst alleine dar, dann mit dem darauffolgenden und dann mit dem übernächsten, lässt aber das darauffolgende weg (skip). Es nimmt also immer wieder ein Wort dazu, lässt dieses aber danach wieder weg. Bei n=2 sind es 3 Erzeugnisse pro Wort, bei n=3 schon 7 Erzeugnisse, bei n=4 sind es 15.

Der Nutzen könnte sein, dass das Erkennen von Zusammenhängen nicht durch Füllworte (und, aber etc) behindert wird.

c) Übung Word clouds mit mehreren PDF

Machen Sie eine WordCloud für mindestens 10 PDF Dokumente in einem Ihrer Ordner mit Hilfe des R-Skriptes tm-sent.R in den Materialien.

all_pdfs <- list.files(pattern = "*.pdf$", recursive = TRUE) # with recursive, all subfolders get searched (in this case a folder with bwl course slides [16 pdf files])
sw <- get_stopwords(language = "de", source = "snowball")
weglassen <- c("école", "haute", "applied", "bern", "berner","bernoise","sciences", "university","spécialisée", "fachhochschule")

map_df(all_pdfs, ~ tibble(txt = pdf_text(.x), mustWork = F) %>%
         mutate(filename = .x) %>%
         unnest_tokens(word, txt)) %>%
  anti_join(stop_words) %>%
  anti_join(sw) %>%
  filter(!word %in% weglassen) %>%
  filter(!grepl("[0-9]", word)) -> mypdf
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
library(wordcloud2)
# wordcloud2(size=.4, mypdf %>% count(word, sort = TRUE)) # uncomment to generate the interactive wordcloud

# Below is png of generated wordcloud, since only one wordcloud per rmarkdown.html can be present.

d) Sentiment analysis

library(textdata)
get_sentiments("afinn")
get_sentiments("bing")
jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 131015 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
jane_austen_sentiment
library(ggplot2) 
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) + 
  facet_wrap(~book, ncol = 2, scales = "free_x")

Sentiment analysis in medical context

From https://www.sciencedirect.com/science/article/pii/S0933365715000299?via%3Dihub (Sentiment analysis in medical settings: New opportunities and challenges, Artificial Intelligence in Medicine, Volume 64, Issue 1, 2015, Kerstin Denecke, Yihan Deng):

Entity Possible sentiment values
Health status Improve, worsen
Medical condition Present, improve, worsen
Diagnosis Certain, uncertain, preliminary
Effect of a medical event Critical, non-critical
Medical procedure Positive or negative outcome, successful or unsuccessful
Medication Helpful, useless, serious adverse events

Übungsaufgabe

Tokenisieren Sie reviews.csv in den Materialien analog wie original_books <- austen_books() zu tidybooks im Abschnitt “Text Mining mit TidyText” vom R-Skript tm-sent.R tokenisiert wurde. Wenden Sie anschließend die Sentiment-Analyse aus tm-sent.R auf reviews.csv an. Wie interpretieren Sie das Ergebnis?

library(tidyr)
library(dplyr)
library(tidytext)
library(stringr)
library(textdata)

original_reviews <- as_tibble(read.csv("./Materialien-20231126/reviews.csv"))

tidy_reviews <- original_reviews %>%
  unnest_tokens(word, sentence)
tidy_reviews
reviews_sentiment <- tidy_reviews %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentence_index, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
reviews_sentiment
library(ggplot2) 
ggplot(reviews_sentiment, aes(sentence_index, sentiment, fill = sentence_index)) +
  geom_col(show.legend = FALSE) 

From Chris Bail, PhD, Duke University (https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html):

So what are Word Embeddings Useful for? Unfortunately, many of us are still trying to work this out. They have obvious applications for industry (especially in the area of search and information retrieval), but it is less clear how they can be used for empirical studies.”